Interactions with MLDB occurs via a REST API. Interacting with a REST API over HTTP from a Notebook interface can be a little bit laborious if you're using a general-purpose Python library like requests
directly, so MLDB comes with a Python library called pymldb
to ease the pain.
pymldb
does this in three ways:
BatFrame
class: this is a class that behaves like the Pandas DataFrame but offloads computation to the server via HTTP calls. This tutorial shows you how to use it.Resource
class: this is simple class which wraps the requests
library so as to make HTTP calls to the MLDB API more friendly in a Notebook environment. Check out the Resource Wrapper Tutorial for more info on the Resource
class.%mldb
magics: these are Jupyter line- and cell-magic commands which allow you to make raw HTTP calls to MLDB, and also provides some higher-level functions. Check out the Cell magic Tutorial for more info on the %mldb
magic system.The pymldb
library includes a system similar to the pandas
DataFrame called the BatFrame. If you are familiar with Pandas' DataFrame then you will feel right at home. The BatFrame allows you to interact with the database in a familiar syntax. Every dataset can be conceptualised as a matrix of values like any SQL tables, but with the addition of a time component. For now we will leave the time component outside of the BatFrame exploration and focus on the well know row/column format of a table.
This tutorial will walk you through the various features of the BatFrame.
We'll use the %mldb
magic commands to interact with the REST API to delete the dataset if it's already been loaded, and then run a remote Python script to load the data. See the Cell Magic Tutorial for more details or execute %mldb help
after running %reload_ext pymldb
.
In [1]:
%reload_ext pymldb
%mldb DELETE /v1/datasets/superheroes_and_villains
%mldb py http://opensource.datacratic.com/mldb-demo-resources/import_superheroes_and_villains.py
In [2]:
from pymldb.data import BatFrame
The BatFrame is instantiated with the address of a dataset. Here we instantiate one on the dataset that was created above. Pretty straight forward.
In [3]:
bf = BatFrame("http://localhost/v1/datasets/superheroes_and_villains")
In [4]:
type(bf)
Out[4]:
In [5]:
bf
Out[5]:
We can first get a list of the available columns with
In [6]:
bf.columns
Out[6]:
Just like you would expect, you use the square brackets with the name of the column.
In [7]:
bf["Skin"]
Out[7]:
Selecting one column will return a Column object that can later be used in other calculations.
In [8]:
height = bf["Height"]
type(height)
Out[8]:
Or a list of columns. The columns don't need to be consecutive. It is totally independent of the order the might have been inserted in the dataset. So imagine for example that you wanted to print the values of a few columns, you can do so in whatever order you want.
In [9]:
bf[["Gender", "Height", "Eyes"]]
Out[9]:
Similar to the column slicing and indexing, you can view rows via the ix property of the batframe
In [10]:
bf.ix["Batman"]
Out[10]:
In [11]:
bf.ix[["Batman", "Superman"]]
Out[11]:
You can also select portions of you data by slicing the rows and column via the ix property.
In [12]:
bf.ix["Batman", "Height"]
Out[12]:
If your columns contain integers or float, you can easily do binary arithmetics on them. For example, the Height column is in inches. Let's say you wanted it in meters instead
You can do any of these:
- Addition (+)
- Substraction (-)
- Multiplication (*)
- Division (/)
- Modulo (%)
- Power (**)
In [13]:
height = (bf["Height"] * 2.54)/100
height
Out[13]:
Of if you wanted the Weight in kilograms instead of pounds
In [14]:
weight = bf["Weight"] / 2.2
weight
Out[14]:
For example, let's calculate the BMI (Body Mass Index) of every superheroes and villains at our disposal
In [15]:
bmi = weight / (height**2)
bmi
Out[15]:
And as you would expect, you cannot divide by zero
In [16]:
import traceback
try:
bf["Weight"] / 0
except ValueError:
print traceback.format_exc()
Negate a column
In [17]:
-bf["Height"]
Out[17]:
Using the comparison operators on a column will return a Query object that is used by either a column or a BatFrame to filter out the values
You can do any of these:
- Greater than (>)
- Greater than or equal (>=)
- Lesser than (<)
- Lesser than or equal (<=)
- Equal (==)
- Not equal(!=)
In [18]:
type(bf["Height"] > 70)
Out[18]:
In [19]:
bf[bf["Height"] > 70]
Out[19]:
Here we have the columns Good and Bad which contains boolean values. You can filter your BatFrame based on that as well
In [20]:
bf[bf["Good"]]
Out[20]:
You can even combine boolean columns if you wish
In [21]:
bf[bf["Good"] & bf["Bad"]]
Out[21]:
And we see that only Catwoman is both good and bad because she is at times a thief and at other times romantically involved with Batman and helping him.
Or you can do things a bit more complicated like who are the Superheroes (Good and not Bad) that are at least 73 inches or shorter than 70.
In [22]:
bf[(bf["Good"] & ~bf["Bad"]) & ((bf["Height"] >= 73) | (bf["Height"] < 70))]
Out[22]:
Getting the min/max value of a column
In [23]:
bf["Height"].min()
Out[23]:
In [24]:
bf["Weight"].min()
Out[24]:
Printing only the head of the dataset because it can be too big to fit in memory. This works on both BatFrame and Column objects
In [25]:
bf.head()
Out[25]:
In [26]:
bf["Eyes"].head()
Out[26]:
Sometimes you just want the unique values that a column contains to see, for example, the range of values
In [27]:
bf["Eyes"].unique()
Out[27]:
Both the BatFrame and Column can be sorted. Let's start with the simplest one, Column.
In [28]:
# sort takes an optional argument ascending. By default it's True.
# Puting it at False would return the values in descending order
bf["Height"].sort()
Out[28]:
The BatFrame offers a bit more versatilty. You can sort on one particular column
In [29]:
bf.sort("Height")
Out[29]:
Or a list of columns. The batframe will be ordered based on the order of the list
In [30]:
bf.sort(["Height", "Eyes"])
Out[30]:
You can ask for it to be sorted in reverse order (aka descending order)
In [31]:
bf.sort(["Height", "Weight"], ascending=False)
Out[31]:
Let's say you wanted to sort your data in different order for each column, then just provide a list of booleans to the parameter ascending and it will do just that
In [32]:
bf.sort(["Height", "Weight"], ascending=[False, True])
Out[32]:
The characters are firstly order (ascending=False) on Height. Then, among those who are 67 inches high (for example), they are ordered from lightest to heaviest (ascending=True)
In [33]:
%pylab inline
In [34]:
from bokeh.plotting import *
from bokeh.charts import Histogram
from bokeh.models import HoverTool
In [35]:
output_notebook()
In [36]:
from pylab import get_cmap
def color_picker(number, cmap='Paired'):
cm = get_cmap(cmap)
colors = []
for i in range(number):
rgba = cm(1.*i/number)
color = "#"
for hue in rgba[:3]:
f = format(int(hue*255),'x')
if len(f) == 1:
f = '0' + f
color += f
colors.append(color)
return colors
Calling the values attribute on the columns will return a numpy array of the values only.
In [37]:
height = (bf["Height"] * 2.54)/100
height_val = height.values
weight = bf["Weight"]/2.2
weight_val = weight.values
bmi = weight/height**2
bmi_val = bmi.values
In [38]:
p = figure(plot_width=800, plot_height=800, title="Justice League", background_fill="#E5E5E5",
tools="hover")
source = ColumnDataSource(
data=dict(
height = height_val,
weight = weight_val,
hero=bf.rows,
bmi = bmi_val
)
)
male, female = color_picker(2)
colors = [male if gender=="Male" else female for gender in bf["Gender"]]
p.circle('height', 'weight', source=source, radius=5, radius_units="screen", color=colors)
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
('Hero', '@hero'),
('Height', '@height'),
('Weight', '@weight'),
('BMI', '@bmi')
]
p.xaxis.axis_label="Height (Meter)"
p.yaxis.axis_label="Weight (Kg)"
show(p)
Because we also love and respect Pandas, we also provide a way to export the BatFrame to a DataFrame
In [39]:
df = bf.toPandas()
In [40]:
type(df)
Out[40]:
Now that you've seen the basics, check out the Analyzing the Iris Flower Dataset demo to see how to use the BatFrame with MLDB.
In [ ]: